Introduction

We assume that the complete data-set consists of $(X, Z)$ but that only $X$ is observed. The maximum likelihood estimator (MLE) of the parameter $\theta$ is the vector $\hat{\theta}$ that solves the maximization problem \begin{align} \hat{\theta} = \text{argmax} \ln p(X|\theta) \nonumber \end{align}

The expectation maximization (EM) algorithm is a general technique for finding maximum likelihood solutions for probabilistic models having latent variables.

The implicit assumption underlying the EM algorithm is that it is difficult to optimize $\ln p(X|\theta)$ with respect to $\theta$, but that it is much easier to optimize $\ln p(X,Z|\theta)$. We introduce an arbitrary distribution, $q(Z)$, over the latent variables, Z, and note that we can decompose the log-likelihood into two terms as follows \begin{align} \ln p(X|\theta) &= \int_Z q(Z) \ln p(X|\theta) dZ \nonumber \\ &= \int_Z q(Z) \left(\ln p(X,Z|\theta) - \ln q(Z) - \ln p(Z|X, \theta) + \ln q(Z) \right) dZ \nonumber \\ &= \int_Z q(Z) \ln\left(\frac{p(X,Z|\theta)}{q(Z)}\right) dZ - \int_Z q(Z)\ln\left(\frac{p(Z|X,\theta)}{q(Z)}\right) dZ \nonumber\\ &\equiv \mathscr{L}(q,\theta) + \text{KL}(q||p_{Z|X}) \nonumber \end{align}

The EM algorithm is based on the decomposition with an initial parameter estimate $\theta^{\text{old}}$

E-Step: Evaluate $p(Z|X,\theta^{\text{old}})$.

The E-step maximizes the lower bound, $\mathscr{L}(q,\theta^{\text{old}})$, with respect to $q(·)$ while keeping $\theta^{\text{old}}$ fixed. In principle this is a variational problem since we are optimizing a functional, but the solution is easily found.

Note that $\ln p(X|\theta)$ does not depend on $q(·)$. Thus, maximizing $\mathscr{L}(q,\theta^{\text{old}})$ is equivalent to minimizing $\text{KL}(q||p_{Z|X})$. Since $\text{KL}(q||p_{Z|X}) \ge 0$ always and the equality holds if $q(Z) = p(Z|X,\theta)$, the solution is $q(Z) = p(Z|X,\theta^{\text{old}})$.

M-Step: Evaluate $\theta^{\text{new}}$ given by \begin{align} \theta^{\text{new}} = \underset{\theta}{\text{argmax}} \ \mathscr{L}(q,\theta) = \underset{\theta}{\text{argmax}} \ \int_Z p(Z|X, \theta^{\text{old}}) \ln p(X,Z|\theta) dZ = \underset{\theta}{\text{argmax}} \ \mathbb{E}_{Z} \left[ \ln p(X,Z|\theta^{\text{old}}) \right] \nonumber \end{align}

with $q(Z) = p(Z|X,\theta^{\text{old}})$.

Iterative Step: Check for convergence of either the log likelihood or the parameter values. If the convergence criterion is not satisfied, then let $$\theta^{\text{old}} \leftarrow \theta^{\text{new}}$$ and return to E-Step.

Theoretical Guarantees

The EM algorithm guarantes that the likelihood increases at each iteration: $$ p(X|\theta^{\text{new}}) \ge p(X|\theta^{\text{old}}) $$

In general, the algorithm is not guaranteed to converge to a global or local maximum the log-likelihood, because it can get stuck at saddle points.

Proof
\begin{align} \ln p(X|\theta) &= \int_Z p(Z|X, \theta^{\text{old}}) \ln p(X|\theta) dZ \nonumber \\ &= \int_Z p(Z|X, \theta^{\text{old}}) \ln p(X,Z|\theta) dZ - \int_Z p(Z|X, \theta^{\text{old}}) \ln p(Z|X,\theta) dZ \nonumber \\ &\equiv Q(\theta,\theta^{\text{old}}) - \int_Z p(Z|X, \theta^{\text{old}}) \ln p(Z|X,\theta) dZ \nonumber \end{align}\begin{align} \ln p(X|\theta^{\text{new}}) - \ln p(X|\theta^{\text{old}}) &= Q(\theta^{\text{new}},\theta^{\text{old}}) - Q(\theta^{\text{old}},\theta^{\text{old}}) - \int_Z p(Z|X, \theta^{\text{old}}) \ln \frac{p(Z|X,\theta^{\text{new}})}{p(Z|X,\theta^{\text{old}})} dZ \nonumber \\ & \stackrel{[1]}{\ge} Q(\theta^{\text{new}},\theta^{\text{old}}) - Q(\theta^{\text{old}},\theta^{\text{old}}) - \ln \int_Z p(Z|X, \theta^{\text{old}}) \frac{p(Z|X,\theta^{\text{new}})}{p(Z|X,\theta^{\text{old}})} dZ \nonumber \\ &= Q(\theta^{\text{new}},\theta^{\text{old}}) - Q(\theta^{\text{old}},\theta^{\text{old}}) \stackrel{[2]}{\ge} 0 \nonumber \\ \end{align}

[1] holds because of Jensen's inequality. [2] holds because $\theta^{\text{new}} = \text{argmax}_{\theta} \ Q(\theta,\theta^{\text{old}})$.


Mixtures of Gaussians

Let $\mathbf{x}$ follow a Gaussian mixture distribution that $$ p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) $$

and $$ 1 \ge \pi_k \ge 0, \text{ and } \sum_{k=1}^{K} \pi_k = 1 $$

Let $X$ represent the set of observations $\{x_1,\ldots,x_n\}$. The log of the likelihood function is given by $$ \ln p(X|\theta) = \sum_{i=1}^{N} \ln \left\{ \sum_{k=1}^{K} \pi_k \mathcal{N}(\boldsymbol{x}_i |\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right\} $$ where the parameter $\theta = (\boldsymbol{\pi}, \boldsymbol{\mu}, \boldsymbol{\Sigma})$. Setting the derivatives of $\ln p(X|\theta)$ with respect to each element of $\theta$ to zero, we obtain \begin{align} \boldsymbol{\mu}_{k} &= \frac{1}{N_k}\sum_{i=1}^N \gamma_{ik} \mathbf{x}_i \label{gmm1} \\ \boldsymbol{\Sigma}_k &= \frac{1}{N_k}\sum_{i=1}^N \gamma_{ik} (\mathbf{x}_i - \boldsymbol\mu_{k})(\mathbf{x}_i - \boldsymbol\mu_{k})^\intercal \label{gmm2} \\ \pi_k &= \frac{N_k}{N} \label{gmm3} \end{align}

where $N_K = \sum_{i=1}^N \gamma_{ik})$ and $$ \gamma_{ik} = \frac{\pi_k \mathcal{N}(\mathbf{x}_i|\boldsymbol \mu_k, \boldsymbol \Sigma_k)}{\sum_{l=1}^K\pi_j \mathcal{N}(\mathbf{x}_i|\boldsymbol \mu_l, \boldsymbol \Sigma_l)}. $$

Note that, for the derivative with respect to $\boldsymbol{\pi}$, we must take account of the contraint $\sum_{k=1}^{K} \pi_k = 1$ and use a Lagrange function. It is worth emphasizing that the equations \eqref{gmm1}-\eqref{gmm3} do not constitue a closed-form solution for the parameters of the mixture model because $\boldsymbol \gamma$ depend on those parameters in a complex way. In fact, there is a significant problem associated with the maximum likelihood framework applied to Gaussian mixture models, due to the presence of singularities.

Singularities

One of the Gaussian components "collapses" onto a specific data point. These singularities demonstrate the severe over-fitting that can occur in a maximum likelihood approach.

Illustration

For the purposes of illustration, we assume $\boldsymbol{\Sigma}_k = \sigma^2_k I$, and the dimension of $\mathbf{x}$ is $p$. We will show that a solution with $\boldsymbol{\mu}_1 = \boldsymbol{x}_j$ for some $j \in \{1,\ldots,N\}$, $\sigma_1 \to 0$ can maximize the likelihood function to infinity. \begin{align} \ln p(X|\theta) &= \sum_{i=1}^{N} \ln \left\{ \sum_{k=1}^{K} \pi_k \mathcal{N}(\boldsymbol{x}_i |\boldsymbol{\mu}_k,\boldsymbol {\Sigma}_k) \right\} \nonumber \\ &= \sum_{i=1}^{N} \ln \left\{ \sum_{k=1}^{K} \frac{\pi_k}{(2 \pi)^{p/2}\sigma_k} \exp \left( - \frac{1}{2 \sigma_k} (\mathbf{x}_i -\boldsymbol{\mu}_k)^\intercal (\mathbf{x}_i - \boldsymbol{\mu}_k) \right) \right\} \nonumber \\ &= \sum_{i=1}^{N} \ln \left\{ \frac{\pi_1}{(2 \pi)^{p/2}\sigma_1} \sum_{k=1}^{K} \frac{\pi_k}{\sigma_k} \frac{\sigma_1}{\pi_1}\exp \left( - \frac{1}{2 \sigma_k} (\mathbf{x}_i -\boldsymbol{\mu}_k)^\intercal (\mathbf{x}_i - \boldsymbol{\mu}_k) \right) \right\} \nonumber \\ &= N \ln \left\{ \frac{\pi_1}{(2 \pi)^{p/2}\sigma_1} \right\} + \sum_{i=1}^{N} \ln \left\{ \sum_{k=1}^{K} \frac{\pi_k}{\sigma_k} \frac{\sigma_1}{\pi_1}\exp \left( - \frac{1}{2 \sigma_k} (\mathbf{x}_i -\boldsymbol{\mu}_k)^\intercal (\mathbf{x}_i - \boldsymbol{\mu}_k) \right) \right\} \nonumber \\ &= N \ln \left\{ \frac{\pi_1}{(2 \pi)^{p/2}\sigma_1} \right\} + \ln \left\{ \sum_{k=1}^{K} \frac{\pi_k}{\sigma_k} \frac{\sigma_1}{\pi_1} \right\} + \sum_{i=1,\ne j}^{N} \ln \left\{ \sum_{k=1}^{K} \frac{\pi_k}{\sigma_k} \frac{\sigma_1}{\pi_1}\exp \left( - \frac{1}{2 \sigma_k} (\mathbf{x}_i -\boldsymbol{\mu}_k)^\intercal (\mathbf{x}_i - \boldsymbol{\mu}_k) \right) \right\} \nonumber \\ &= N \ln \left\{ \frac{\pi_1}{(2 \pi)^{p/2}\sigma_1} \right\} + \ln \left\{ \sum_{k=1}^{K} \frac{\pi_k}{\sigma_k} \frac{\sigma_1}{\pi_1} \right\} \nonumber \\ & \quad \quad + \sum_{i=1,\ne j}^{N} \ln \left\{ \underbrace{\exp \left( - \frac{1}{2 \sigma_1} (\mathbf{x}_i -\boldsymbol{\mu}_1)^\intercal (\mathbf{x}_i - \boldsymbol{\mu}_1) \right)}_{A} + \underbrace{\sum_{k=2}^{K} \frac{\pi_k}{\sigma_k} \frac{\sigma_1}{\pi_1}\exp \left( - \frac{1}{2 \sigma_k} (\mathbf{x}_i -\boldsymbol{\mu}_k)^\intercal (\mathbf{x}_i - \boldsymbol{\mu}_k) \right)}_{B} \right\} \nonumber \end{align}

Consider the last equation. The first term goes to $+\infty$ when $\sigma_1 \to 0$. The second term is a constant. The third term is a constant because $A \to 0$ and $B$ can be a constant when $\sigma_1 \to 0$. Therefore, the log-likelihood function $\ln p(X|\theta)$ goes to $+\infty$ when $\sigma_1 \to 0$.

However, if $K=1$, then $B$ vanishes and the third term becomes $$ \sum_{i=1,\ne j}^{N} - \frac{1}{2 \sigma_1} (\mathbf{x}_i -\boldsymbol{\mu}_1)^\intercal (\mathbf{x}_i - \boldsymbol{\mu}_1) $$

which goes to $-\infty$ when $\sigma_1 \to 0$ and faster than the first term. Hence, the proposed solution does not maximize the log-likelihood function when $K=1$ and that is why we do not encounter the issue with a single Gaussian.


We introduce a latent variable $\mathbf{z} \in \{0,1\}^K$ such that $\sum_{k=1}^{K} z_k = 1$. The marginal distribution of $z_k$ is $p(z_k) = \pi_k$, and the joint distribution is $$ p(\mathbf{z}) = \prod_{k=1}^{K} \pi_{k}^{z_k} $$

The Gaussian mixture model has \begin{align} p(X) = \sum_{Z}p(Z)p(X|Z) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \nonumber \end{align}

The log likelihood function of the joint distribution of $(X,Z)$ takes the form $$ \ln p(X,Z|\theta) = \sum_{i=1}^{N} \sum_{k=1}^{K} z_{ik} \left\{ \ln \pi_k + \ln \mathcal{N}(\boldsymbol{x}_i |\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right\} $$

The objective function in the M-Step is $$ \mathbb{E}_{Z} \left[ \ln p(X,Z|\theta) \right] = \sum_{i=1}^{N} \sum_{k=1}^{K} \mathbb{E}_{Z} [z_{ik}] \left\{ \ln \pi_k + \ln \mathcal{N}(\boldsymbol{x}_i |\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right\} $$

In the general EM algorithm framework, $p(Z|X,\theta)$ is evaluated in the E-Step so that $\mathbb{E}_{Z} \left[ \ln p(X,Z|\theta) \right]$ can be calculated in the M-Step. Due to the simpler form of $\mathbb{E}_{Z} \left[ \ln p(X,Z|\theta) \right]$ in the mixtures of Gaussians model, instead of evaluating the distribution $p(Z|X,\theta)$, we just need to evaluate \begin{align} \mathbb{E}_{Z} [z_{ik}] \stackrel{[2]}{=} p(z_{ik}) \nonumber \\ \end{align}

in the E-Step, where [2] holds because $\{\mathbf{z}_i\}$ are independent. The EM algorithm for Guassian mixtures is as follows:

  1. Initialize the means $\boldsymbol \mu_k$, covariances $\boldsymbol \Sigma_k$ and mixing coefficients $\pi_k$, and evaluate the initial value of the log likelihood

  2. E Step: Evaluate the responsibilites using the current parameter values $$ p(z_{ik}=1|{\bf x}) = \gamma(z_{ik}) = \frac{\pi_k \mathcal{N}({\mathbf x}_i|\boldsymbol \mu_k, \boldsymbol \Sigma_k)}{\sum_{l=1}^K \pi_l \mathcal{N}({\mathbf x}_i|\boldsymbol \mu_l, \boldsymbol \Sigma_l)} $$

  3. M Step: Re-estimate the parameters using the current probabilities \begin{align} \boldsymbol\mu_{k}^{\text{new}} &= \frac{1}{N_k}\sum_{i=1}^N \gamma(z_{nk}) {\mathbf x}_i \nonumber \\ \boldsymbol{\Sigma}_k^{\text{new}} &= \frac{1}{N_k}\sum_{i=1}^N \gamma(z_{ik}) ({\mathbf x}_i - \boldsymbol\mu_{k}^{\text{new}})({\mathbf x}_i - \boldsymbol\mu_{k}^{\text{new}})^\intercal \nonumber\\ \pi_k^{\text{new}} &= \frac{N_k}{N} \nonumber \end{align}

Where $N_K = \sum_{i=1}^N \gamma(z_{ik})$

  1. Evaluate the log likelihood and check for convergence of either the parameters or the log likelihood. If convergence is not met return to step 2. $$ \log p(X|\boldsymbol\pi, \boldsymbol\mu, \boldsymbol\Sigma) = \sum_{i=1}^N \log\left(\sum_{k=1}^K \pi_k \mathcal{N}({\mathbf x}_i|\boldsymbol \mu_k, \boldsymbol \Sigma_k)\right) $$

Note:

  • The covariance matrices can conviniently be initialized to the sample covariances of the clusters found by the K-means algorithm, and
  • the mixing coeffients can be set to the fractions of data points assigned to the respective clusters
from scipy.stats import multivariate_normal
from sklearn.datasets import make_biclusters

X, y, _ = make_biclusters((200, 2), 2, noise=0.24, random_state=314, minval=0, maxval=1)
# X contains 200 points and each row is a point
# y.shape = (2, 200) each row is for a cluster where True indicates a point is in this cluster 
# y.argmax(axis=0) generates the cluster id for each point
y = y.argmax(axis=0)
plt.scatter(*X.T, c=y, s=10, cmap='winter')
plt.show()
K = 2
pi = 0.5*np.ones(K)
# we deliberately choose poor initial values for the cluster centres
# so that the algorithm takes several steps before convergence.
mu = [[-0.2,   1],
      [   1, 0.3]]
sigma = np.repeat([np.identity(K)/20], 2, axis=0)

hist = []
while True:
    # E-Step
    resp = [pi[i]*multivariate_normal(mean=mu[i], cov=sigma[i]).pdf(X) for i in range(K)]
    resp = resp / np.sum(resp, axis=0)

    # Evaluate the log likelihood
    log_likelihood = \
        np.log(np.sum([pi[i]*multivariate_normal(mean=mu[i], cov=sigma[i]).pdf(X) for i in range(K)], axis=0)).sum()
    hist.append((resp, pi, mu, sigma, log_likelihood))
    
    # M-Step
    Nk = resp.sum(axis=1)
    pi = Nk / Nk.sum()
    mu = np.divide((resp @ X).T, Nk).T
    sigma = [(resp[i].reshape(-1,1) * (X-mu[i])).T @ (X-mu[i]) / Nk[i] for i in range(K)]
    
    if len(hist)>=2 and np.abs(hist[-1][-1] - hist[-2][-1])<1e-9:
        break

# show the animation        
def plot_mixtures(X, resp, pi, mu, sigma, step=0.01):
    plt.scatter(*X.T, alpha=0.7, c=resp[0], s=20, cmap='winter')
    x, y = np.mgrid[plt.xlim()[0]:plt.xlim()[1]:step, plt.ylim()[0]:plt.ylim()[1]:step]
    grid = np.dstack((x, y))
    for i in range(K):
        plt.contour(x, y, multivariate_normal(mean=mu[i], cov=sigma[i]).pdf(grid).reshape(x.shape), levels=3)

import matplotlib.animation as animation
from IPython.display import HTML
fig, ax = plt.subplots()

def animate(frame_num):
    plt.cla()
    resp, pi, mu, sigma = hist[frame_num][:-1]
    plot_mixtures(X, resp, pi, mu, sigma)
    plt.title(f"EM Iteration: {frame_num}")

niters = len(hist)
anim = animation.FuncAnimation(fig, animate, niters, interval=300)
# anim.save("gmm.gif", fps=9, dpi=150, writer="Pillow")
display(HTML(anim.to_jshtml()))
plt.close()        

$K$-means Clustering

The $K$-means clustering is to find an assignment of data points $\{\mathbf{x}_1,\ldots,\mathbf{x}_N\}$ to $K$ clusters where the set of vectors $\{\mu_1,\ldots,\mu_K\}$ represents the centres of the clusters. This is equivalent to minimize the following objective function $$ J = \sum_{i=1}^{N} \sum_{k=1}^{K} r_{ik} ||\mathbf{x}_i - \mu_k||^2 $$

where $r$ is known as the 1-of-$K$ coding scheme that $r_{ik} = 1$ if data point $\mathbf{x}_i$ is assigned to cluster $k$ or $r_{ik} = 0$ otherwise.

Consider a Gaussian mixture model in which the covariance matrices of the mixture components are given by $\epsilon I$, where $\epsilon$ is a common variance parameter. In the E step, we evaluate, for each $i \in \{1,\ldots,N\}$, \begin{align} \gamma_{ik} = \gamma(z_{ik}) &= \frac{\pi_k \mathcal{N}({\mathbf x}_i|\boldsymbol \mu_k, \boldsymbol \Sigma_k)}{\sum_{l=1}^K \pi_l \mathcal{N}({\mathbf x}_i|\boldsymbol \mu_l, \boldsymbol \Sigma_l)} = 1 \big/ \left( 1 + \sum_{l=1,\ne k}^K \frac{\pi_l}{\pi_k} \frac{\mathcal{N}({\mathbf x}_i|\boldsymbol \mu_l, \boldsymbol \Sigma_l)}{\mathcal{N}({\mathbf x}_i|\boldsymbol \mu_k, \boldsymbol \Sigma_k)} \right) \nonumber \\ & = 1 \big/ \left( 1 + \sum_{l=1,\ne k}^K \frac{\pi_l}{\pi_k} \frac{\exp (-||\mathbf{x}_i - \boldsymbol{\mu}_l||^2/2\epsilon)}{\exp (-||\mathbf{x}_i - \boldsymbol{\mu}_k||^2/2\epsilon)} \right) \nonumber \\ & = 1 \big/ \left( 1 + \sum_{l=1,\ne k}^K \frac{\pi_l}{\pi_k} \exp \left((||\mathbf{x}_i - \boldsymbol{\mu}_k||^2-||\mathbf{x}_i - \boldsymbol{\mu}_l||^2)/2\epsilon \right) \right) \nonumber \end{align}

Let $\hat k = \arg\min_k||{\mathbf x}_i - {\boldsymbol\mu}_k||^2$. In the E-Step, we have $$ \gamma_{i k} = \begin{cases} 1 & \text{if } k = \hat k \\ 0 & \text{otherwise} \end{cases} $$

when $\epsilon \to 0$. In the M-Step, we have \begin{align} \text{argmax} \mathbb{E}_{Z} \left[ \ln p(X,Z|\theta) \right] &= \text{argmax} \sum_{i=1}^{N} \sum_{k=1}^{K} \mathbb{E}_{Z} [z_{ik}] \left\{ \ln \pi_k + \ln \mathcal{N}(\boldsymbol{x}_i |\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \right\} \nonumber\\ &\rightarrow \text{argmax} -\frac{1}{2} \sum_{i=1}^{N} \sum_{k=1}^{K} \gamma_{ik} ||\mathbf{x}_i - \boldsymbol{\mu}_k||^2 \nonumber \end{align}

with $r_{ik}$s are fixed to the value obtained in the E-Step. Hence, the EM steps in the algorithm are:

E-Step: Minime $J$ w.r.t. $\{r_{ik}\}_{i,k}$ by keeping $\{{\boldsymbol\mu}_k\}_{k}$ fixed. That is assigning the $n$-th datapoint to the closest prototype, i.e., $$ \gamma_{i k} = \begin{cases} 1 & \text{if } k = \hat k \\ 0 & \text{otherwise} \end{cases} $$

M-Step: Minimize $J$ w.r.t $\{{\boldsymbol\mu}_k\}_{k}$ by keeping $\{r_{ik}\}_{i,k}$ fixed. Hence, we have, for cluster $k$, $$ {\bf\mu}_{k} = \frac{1}{N_{k}} \sum_{\text{data point in cluster $k$}}{\mathbf x}_i $$ where $N_k$ is the number of data points assigned to cluster $k$.

Note that the K-means algorithm does not estimate the covariances of the clusters but only the cluster means.

Mixtures of Bernoulli Distributions

Consider a set of $D$ binary variables $\mathbf{x} = \{x_d\}_{d=1}^D$, each of which is governed by a finite ($K$) mixture of Bernoulli distributions, so that \begin{align} p({\mathbf x}|\boldsymbol \mu, \pmb \pi) = \sum_{k=1}^{K} \pi_k p({\mathbf x}|\pmb{\mu}_k) \label{mbd} \end{align}

where $$ p({\mathbf x}|\pmb{\mu}_k) = \prod_{d=1}^D p(x_d|\mu_{kd}) = \prod_{d=1}^D \mu_{kd}^{x_d}(1 - \mu_{kd})^{(1-x_d)} $$

where each variable $x_d$ follows a Bernoulli distribution and $\mu_d$ is the probability that $x_d=1$.

As in the case of the Gaussian mixture, we introduce latent variable $\mathbf{z} = (z_1,\ldots, z_K)^\intercal$, which is a binary $K$-dimensional variable having a single component equal to 1, with all other components equal to 0. We have \begin{align} p({\mathbf x}|\mathbf{z}, \pmb{\mu}_k) &= \prod_{k=1}^{K} p({\mathbf x}|\pmb{\mu}_k)^{z_k} \text{ and } p(\mathbf{z} | \pmb{\pi}) = \prod_{k=1}^{K} \pi_{k}^{z_k} \end{align}

If we form the product of $p({\mathbf x}|\mathbf{z}, \pmb{\mu}_k)$ and $p(\mathbf{z} | \pmb{\pi})$ and then marginalize over $\mathbf{z}$, then we recover \eqref{mbd}.

For data $(\mathbf{x}_i, \mathbf{z}_i)_{i=1,\ldots,N}$, the expectation of the log likelihood function with respect to the posterior distribution of the latent variables is \begin{align} \mathbb{E}_{Z} \left[\ln p(X, Z|\pmb{\mu}, \pmb{\pi})\right] = \sum_{i=1}^{N} \sum_{k=1}^{K} \mathbb{E}_{Z}(z_{ik}) \left( \ln \pi_k + \sum_{d=1}^{D} \left[ x_{id} \ln\mu_{kd} + (1-x_{id}) \ln(1-\mu_{kd})\right] \right) \end{align}

In the E-Step, we compute $$ \mathbb{E}_Z[z_{ik}] = \gamma(z_{ik}) = \frac{\pi_{k}p({\mathbf{x}_i}|\pmb{\mu}_{k})}{\sum_{l=1}^K \pi_{l}p({\mathbf{x}_i}|\pmb{\mu}_{l})} $$

In the M-Step, we maximize $\mathbb{E}_{Z} \left[\ln p(X, Z|\pmb{\mu}, \pmb{\pi})\right]$ with respect to the parameters $\boldsymbol\pi$, $\boldsymbol \mu$, and arrive at the equations: \begin{align} \mu_k &= \bar{\mathbf x}_k = \frac{1}{N_k}\sum_{i=1}^N \gamma(z_{ik})\mathbf{x}_i \nonumber \\ \pi_k &= \frac{N_k}{N} \nonumber \\ N_k &= \sum_{n=1}^N \gamma(z_{nk}) \nonumber \end{align}

The log likelihood function of the data $\{\mathbf{x}_i\}_{1,\ldots,N}$ is $$ \ln p(X|\pmb{\mu}, \pmb{\pi}) = \sum_{i=1}^{N} \ln \left( \sum_{k=1}^{K} \pi_k p(\mathbf{x}_i|\pmb{\mu}_k) \right) $$

In our digit image example, $\mathbf{x}$ is an binary array with 784 ($28\times28$) elements.

from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()

select = np.array([1,2,3])
# select = np.arange(10)
idx = (y_train[:, None]==select).any(axis=1)

# convert X_train to Bernoulli (i.e. black or white)
# original data allows different degree of gray
X_train, y_train = ((X_train[idx] / 255) > 0.5) * 1, y_train[idx]
X = X_train.reshape(X_train.shape[0], -1)

K = select.shape[0]
N = X.shape[0]
D = X.shape[1]

np.random.seed(314)
pi = np.ones(K) / K
mu = np.random.uniform(low=0.25, high=0.75, size=(K, D))
hist = [(0, mu)]

while True:
    p = (mu**X[:, None, :] * (1 - mu)**(1 - X[:, None, :]))
    r = p.prod(axis=-1) * pi
    r = r / r.sum(axis=-1, keepdims=True)

    Nk = r.sum(axis=0)
    mu = (r.T @ X) / Nk[:, None]
    pi = Nk / N
    
    log_likelihood = np.log((p.prod(axis=-1) * pi).sum(axis=-1)).sum()
    hist.append((log_likelihood, mu))
    
    if len(hist)>=2 and np.abs((hist[-1][0] - hist[-2][0])/hist[-1][0])<1e-5:
        break

# plotting 4 steps from the beginning to see how EM convergences
nsteps = min(len(hist), 4) 
fig, ax = plt.subplots(nsteps, K, figsize=(3, 5))
for i, ax_row in enumerate(ax):
    for k, axi in enumerate(ax_row):
        axi.imshow(hist[i][1][k].reshape(*X_train.shape[-2:]), cmap="bone")
        axi.axis("off");